Windows Azure: Building a Secure Backup System (part 6) - Uploading Efficiently Using Blocks

10/22/2010 9:21:16 AM

4.9. Uploading Efficiently Using Blocks

azbackup isn’t only about cryptography and security—it is also about providing a good backup experience (after all, it has the word backup in its name). The straightforward way to back up encrypted data to the cloud is to initiate a “Create blob” operation and start uploading data.

However, there are two downsides to doing things this way. First, uploads are limited to 64 MB with a single request. Backups of huge directories will often be larger than 64 MB. Second, making one long request means not only that you’re not making use of all the bandwidth available to you, but also that you’ll have to restart from the beginning in case a request fails.

The first order of business is to add support in storage.py for adding a block and committing a block list. Example 10 shows the code to do this.

Example 10. Block support in storage.py

def put_block(self, container_name, blob_name, block_id, data):

        # Take a block id and construct a URL-safe, base64 version
        base64_blockid = base64.encodestring(str(block_id)).strip()
        urlencoded_base64_blockid = urllib.quote(base64_blockid)

        # Make a PUT request with the block data to blob URI followed by
        # ?comp=block&blockid=<blockid>
        return self._do_store_request("/" + container_name + "/" + \
                                  blob_name + \
                                          "?comp=block&blockid=" + \
                                          urlencoded_base64_blockid, \
                                          'PUT', {}, data)

    def put_block_list(self, container_name, blob_name, \
                       block_list, content_type):
        headers = {}
        if content_type is not None:
            headers["Content-Type"] = content_type

        # Begin XML content
        xml_request = "<?xml version=\"1.0\" encoding=\"utf-8\"?><BlockList>"

        # Concatenate block ids into block list
        for block_id in block_list:
            xml_request += "<Block>" + \
            base64.encodestring(str(block_id)).strip() + "</Block>"

        xml_request += "</BlockList>"

        # Make a PUT request to blob URI followed by ?comp=blocklist
        return self._do_store_request("/" + container_name + \
                                "/" + blob_name + \
                                          "?comp=blocklist", 'PUT',\
                                 headers, xml_request)

We covered the XML and the URI format in detail earlier in this chapter. Since the XML constructed is fairly trivial and with well-defined character ranges, the code can hand-construct it instead of using Python’s XML support.

With this support in place, azbackup can now chop up the encrypted archive into small blocks, and call the previous two functions to upload them. Instead of uploading the blocks in sequence, they can be uploaded in parallel, speeding up the process. Example 11 shows the entire code.

Example 11. The azbackup block upload code

def upload_archive(data, filename, account, key):
    conn = storage.Storage("blob.core.windows.net",account, key)

    # Try and create container. Will harmlessly fail if already exists
    conn.create_container("enc", False)

    # Heuristics for blocks
    # We're pretty hardcoded at the moment. We don't bother using blocks
    # for files less than 4MB.
    if len(data) < 0:# 4 * 1024 * 1024:
        resp = conn.put_blob("enc", filename, data,"application/octet-stream")
    else:
        resp = upload_archive_using_blocks(data, filename, conn)

    if not (resp.status >=200 and resp.status < 400):
        # Error! No error handling at the moment
        print resp.status, resp.reason, resp.read()
        sys.exit(1)


def upload_archive_using_blocks(data, filename, conn):

    blocklist=[]

    queue = Queue.Queue()
    if parallel_upload:
        # parallel_upload specifies whether blocks should be uploaded
        # in parallel and is set from the command line.
        for i in range(num_threads):
            t = task.ThreadTask(queue)
            t.setDaemon(True) # Run even without workitems
            t.start()

    offset =0

    # Block uploader function used in thread queue
    def block_uploader(connection, block_id_to_upload,\
                     block_data_to_upload):
        resp = connection.put_block("enc", filename, block_id_to_upload,\
                                          block_data_to_upload)
        if not( resp.status>=200 and resp.status <400):
            print resp.status, resp.reason, resp.read()
            sys.exit(1) # Need retry logic on error

    while True:

        if offset>= len(data):
            break

        # Get size of next block. Process in 4MB chunks
        data_to_process = min( 4*1024*1024, len(data)-offset)

        # Slice off next block. Generate an SHA-256 block id
        # In the future, we could use it to see whether a block
        # already exists to avoid re-uploading it

        block_data = data[offset: offset+data_to_process]
        block_id =  hashlib.sha256(block_data).hexdigest()
        blocklist.append(block_id)

        if parallel_upload:
           # Add work item to the queue.
            queue.put([block_uploader, [conn, block_id, block_data]])
        else:

            block_uploader(conn, block_id, block_data)

        # Move i forward
        offset+= data_to_process

    # Wait for all block uploads to finish
    queue.join()

    # Now upload block list
    resp = conn.put_block_list("enc", filename, \
            blocklist, "application/octet-stream")
    return resp

The action kicks off in upload_archive. If the input data is less than 4 MB, the code makes one long sequential request. If it is greater than 4 MB, the code calls a helper function to split and upload the data into blocks. These numbers are chosen somewhat arbitrarily. In a real application, you should test on your target hardware and network to see what sizes and block splits work best for you.

The upload_archive_using_blocks function takes care of splitting the input data into 4 MB blocks (again, another arbitrary size chosen after minimal testing). For block IDs, an SHA-256 hash of the data in the block is used. Though the code doesn’t support it as of this writing, it would be easy to add a feature that checks whether a block of data already exists in the cloud (using the SHA-256 hash and the GetBlockIds operation) before uploading it.

Each block is added into a queue that a pool of threads process. Since Python doesn’t have a built-in thread pool implementation, a simple one that lives in task.py is included in the source code. (The source isn’t shown here, since it isn’t directly relevant to this discussion.) This manages a set of threads that read work items of a queue and process them. Tweaking the number of threads for your specific environment is imperative for good upload performance.

In this case, the “work item” is a function reference (the inner function block_uploader) and a Python tuple containing arguments to that function. When it is time to be processed, block_uploader gets called with the arguments contained in the tuple (a storage connection object, a block ID, and the data associated with that block ID). block_updater calls put_block in the storage module to upload that specific block.

Uploading block uploads in parallel not only provides better performance, but also provides the flexibility to change this code later to support retries on error scenarios, complex back-off strategies, and several other features.

5. Usage

All of this work wouldn’t be much fun if it weren’t useful and easy to use, would it? Using azbackup is actually quite simple. It has a few basic options (parsed using code not shown here), and the workflow is fairly straightforward.

From start to finish, here are all the steps you take to back up data to the cloud and restore encrypted backups:

Set the environment variables AZURE_STORAGE_ACCOUNT and AZURE_STORAGE_KEY to your Windows Azure storage account name and key, respectively. For example, if your blog storage account was at foo.blob.core.windows.net, your AZURE_STORAGE_ACCOUNT should be set to foo. The tool automatically looks for these environment variables to connect to blob storage.
Run python azbackup-gen-key -k keyfilepath, where keyfilepath is the path and filename where your RSA key pairs will be stored.

Warning: Do not lose this file. If you do, you will lose access to data backed up with this tool, and there’s no way to get back the data.

To create a new backup, run python azbackup.py -c -k keyfilepath -f archive_name directory_to_be_backed_up, where keyfilepath is the key file from the previous step, archive is the name of the archive that the tool will generate, and directory_to_be_backed_up is the path of the directory you want backed up. Depending on the size of the directory, this might take some time because this tool isn’t really optimized for speed at the moment. If no errors are shown, the tool will exit silently when the upload is finished.
To extract an existing backup, run python azbackup.py -x -k keyfilepath -f archive_name. This will extract the contents of the backup to the current directory.
All tools take an -h parameter to show you usage information.

Related -----------------

- Windows Azure: Building a Secure Backup System (part 5)

- Windows Azure: Building a Secure Backup System (part 4)

- Windows Azure: Building a Secure Backup System (part 3)

- Windows Azure: Building a Secure Backup System (part 2) - Protecting Data in Motion

- Windows Azure: Building a Secure Backup System (part 1)

Other -----------------

- Understanding Windows Azure Roles

- The Windows Azure Tool Set

- Windows Azure Table Overview (part 2) - Azure Tables Versus Traditional Databases

- Windows Azure Table Overview (part 1) - Core Concepts

- Exploring Group Policy in Windows 7

- Working with Multiple Local Group Policy Objects

- The Windows Azure Sandbox

- Windows Azure : Peeking Under the Hood with a Command Shell (part 2) - Running the Command Proxy

- Windows Azure : Peeking Under the Hood with a Command Shell (part 1) - Building the Command Shell Proxy

- Windows 7 : Using Any Search Engine from the Address Bar

- Windows 7 : Understanding Internet Explorer Advanced Options